We would like to simulate metagenomic data. To begin with, we will use a lot of simplifying assumptions to make things easy.
n_organism <- 100
n_bp <- 10^3
k <- 40
n_kmer <- n_bp - k
time_window <- 14
baseline <- 100
r <- 1
read_depth <- 10^6
Suppose that there are 100 organisms in total (and nothing else exists). Each organism has a genome that is exactly 1000 base-pairs long. Each genome is broken into \(k\)-mers of length 40. There are 960 \(k\)-mers from each organism, obtained by subtracting the \(k\)-mer length from the genome size. Assume that each \(k\)-mer is unique, and there are no sequencing errors.
We collect data over a period of 14 days \(t = 0, 1, \dots\) at a single environmental monitoring site. Every species has a baseline concentration \(c^{(b)}\) in water of 100 copy per \(\mu\)L1. Suppose that one of the species is experiencing exponential growth. It starts at the baseline concentration, but over the 14 day window its concentration increases exponentially according to \[ c^{(e)}_t = c^{(e)}_0 \exp(rt) \] where \(r\) is the growth rate which we take to be 1.
#' The concentration of the baseline organisms over the time window
conc_baseline <- rep(baseline, time_window)
conc_baseline
## [1] 100 100 100 100 100 100 100 100 100 100 100 100 100 100
#' The concentration of the exponentially increasing organism over the time window
conc_exponential <- baseline * exp(r * 0:13)
conc_exponential
## [1] 1.000000e+02 2.718282e+02 7.389056e+02 2.008554e+03 5.459815e+03 1.484132e+04 4.034288e+04 1.096633e+05 2.980958e+05
## [10] 8.103084e+05 2.202647e+06 5.987414e+06 1.627548e+07 4.424134e+07
We will represent the true concentrations of each organism with a
matrix C, and place the exponentially increasing organism
in the first column.
C <- matrix(
data = c(conc_exponential, rep(conc_baseline, n_organism - 1)),
nrow = time_window,
ncol = n_organism
)
C[1:14, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.000000e+02 100 100 100 100
## [2,] 2.718282e+02 100 100 100 100
## [3,] 7.389056e+02 100 100 100 100
## [4,] 2.008554e+03 100 100 100 100
## [5,] 5.459815e+03 100 100 100 100
## [6,] 1.484132e+04 100 100 100 100
## [7,] 4.034288e+04 100 100 100 100
## [8,] 1.096633e+05 100 100 100 100
## [9,] 2.980958e+05 100 100 100 100
## [10,] 8.103084e+05 100 100 100 100
## [11,] 2.202647e+06 100 100 100 100
## [12,] 5.987414e+06 100 100 100 100
## [13,] 1.627548e+07 100 100 100 100
## [14,] 4.424134e+07 100 100 100 100
The true concentration of each \(k\)-mer is that of the corresponding
organism. We can represent this by copying each column of the matrix
C 1000 times.
K <- matrix(rep(as.numeric(t(C)), each = n_kmer), nrow = time_window, byrow = TRUE)
The matrix K now has 14 rows, one for each day, and
96000 columns, one for each \(k\)-mer
(of which there are 960 times 100).
We can calculate proportions, where the total number of \(k\)-mers is given by the row sums:
K_norm <- t(apply(K, 1, function(x) x / sum(x), simplify = TRUE))
useful::topleft(K_norm)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.041667e-05 1.041667e-05 1.041667e-05 1.041667e-05 1.041667e-05
## [2,] 2.783712e-05 2.783712e-05 2.783712e-05 2.783712e-05 2.783712e-05
## [3,] 7.234704e-05 7.234704e-05 7.234704e-05 7.234704e-05 7.234704e-05
## [4,] 1.756925e-04 1.756925e-04 1.756925e-04 1.756925e-04 1.756925e-04
## [5,] 3.702719e-04 3.702719e-04 3.702719e-04 3.702719e-04 3.702719e-04
useful::bottomleft(K_norm)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.001029094 0.001029094 0.001029094 0.001029094 0.001029094
## [2,] 0.001037006 0.001037006 0.001037006 0.001037006 0.001037006
## [3,] 0.001039947 0.001039947 0.001039947 0.001039947 0.001039947
## [4,] 0.001041033 0.001041033 0.001041033 0.001041033 0.001041033
## [5,] 0.001041434 0.001041434 0.001041434 0.001041434 0.001041434
Suppose that the read depth is 10^{6}. We take a multinomial sample
with probabilities of sampling each \(k\)-mer given by K_norm.
#' The proportions from day 1
K_norm_one <- K_norm[1, ]
sample_one <- rmultinom(1, read_depth, K_norm_one)
hist(sample_one)
#' Try to do all of the samples with an apply
sample <- apply(K_norm, 1, function(row) rmultinom(1, read_depth, row))
colnames(sample) <- paste0("day", 1:ncol(sample))
rownames(sample) <- paste0(1:nrow(sample))
sample_df <- sample %>%
as.data.frame() %>%
tibble::rownames_to_column("id") %>%
pivot_longer(
cols = starts_with("day"),
names_to = "day",
values_to = "count",
names_prefix = "day"
) %>%
mutate(
id = as.numeric(id),
day = as.numeric(day),
kmer = rep(rep(1:n_kmer, each = time_window), times = n_organism),
organism = rep(1:n_organism, each = n_kmer * time_window)
)
Just plotting the data from the organism which we have set to be exponentially increasing:
sample_summary <- sample_df %>%
filter(organism == 1) %>%
group_by(day) %>%
summarise(
count_upper = quantile(count, 0.95),
count_median = median(count),
count_lower = quantile(count, 0.05)
)
ggplot(sample_summary, aes(x = day, ymin = count_lower, y = count_median, ymax = count_upper)) +
geom_ribbon(alpha = 0.1) +
geom_point() +
geom_point(data = filter(sample_df, organism == 1), aes(x = day, y = count, col = kmer),
alpha = 0.01, inherit.aes = FALSE) +
theme_minimal() +
labs(x = "Day", y = "Number of reads in sample", col = "kmer")
pickout_day <- 10
With these settings, on day 10 the median count of organism 1 is 1030. At this stage, the exponential growth curve has started to level off because \(k\)-mers from organism 1 already represent 98.88% of the total reads. This is unrealistic as it would be unlikely, or for the novel pathogens we are considering essentially impossible, for any one organism to saturate the space.
Suppose now that the organisms have abundances which vary stochastically, rather than either be fixed or increasing exponentially deterministically.
For the baseline concentration, we will use a lognormal geometric random walk such that \[ c^{(b)}_t = c^{(b)}_{t - 1} \times \exp \left( \epsilon^{(b)}_t \right) \] where \(\exp(\epsilon^{(b)}_t) \sim \mathcal{N}(0, \sigma_b^2)\). Another way to calculate \(c^{(b)}_t\) is as \[ c^{(b)}_t = c^{(b)}_{0} \times \exp \left( \sum_{t = 1}^T \epsilon^{(b)}_t \right) \] Note that the expected cumulative sum of innovations is always zero, such that the expected concentration at any time is that of \(c^{(b)}_{0}\)
n_baseline_paths <- 8
To show how this behaviour looks, let’s simulate 8 baseline paths.
baseline_df <- data.frame(
day = rep(1:time_window, times = n_baseline_paths),
epsilon = rnorm(time_window * n_baseline_paths),
path_number = as.factor(rep(1:n_baseline_paths, each = time_window))
) %>%
group_by(path_number) %>%
arrange(path_number) %>%
mutate(
cumsum_epsilon = cumsum(epsilon),
conc = baseline * exp(cumsum_epsilon)
) %>%
ungroup()
cbpalette <- c("#56B4E9", "#009E73", "#E69F00", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#999999")
baseline_df %>%
pivot_longer(
cols = c("epsilon", "cumsum_epsilon", "conc"),
names_to = "variable",
values_to = "value"
) %>%
mutate(
variable = recode_factor(variable,
"epislon" = "epsilon",
"cumsum_epsilon" = "cumsum(epsilon)",
"conc" = "Concentration")
) %>%
ggplot(aes(x = day, y = value, group = path_number, col = path_number)) +
geom_line() +
facet_wrap(~variable, ncol = 1, scales = "free") +
scale_colour_manual(values = cbpalette) +
theme_minimal() +
labs(x = "Day", y = "", col = "Path number", title = "Baseline behaviour")
The key takeaway for me is the even though the noise is IID and Gaussian, when you take the cumulative sum and exponentiate it can lead to large deviations in concentration. For example, the maximum concentration value observed was 4.1897167^{4} – which is 418.9716669 greater than the baseline concentration.
For the exponential regime, we also suppose a geometric lognormal random walk \[ c^{(e)}_t = c^{(e)}_{t - 1} \times \exp \left( \epsilon^{(e)}_t \right) \] where \(\exp(\epsilon^{(e)}_t) \sim \mathcal{N}(r, \sigma_e^2)\) and the growth rate \(r > 0\). The expected concentration is \(c^{(b)}_{0} \times \exp(rt)\). Let’s simulate some paths from this distribution as before.
exponential_df <- data.frame(
day = rep(1:time_window, times = n_baseline_paths),
epsilon = rnorm(time_window * n_baseline_paths, mean = r),
path_number = as.factor(rep(1:n_baseline_paths, each = time_window))
) %>%
group_by(path_number) %>%
arrange(path_number) %>%
mutate(
cumsum_epsilon = cumsum(epsilon),
conc = baseline * exp(cumsum_epsilon)
) %>%
ungroup()
exponential_df %>%
pivot_longer(
cols = c("epsilon", "cumsum_epsilon", "conc"),
names_to = "variable",
values_to = "value"
) %>%
mutate(
variable = recode_factor(variable,
"epislon" = "epsilon",
"cumsum_epsilon" = "cumsum(epsilon)",
"conc" = "Concentration")
) %>%
ggplot(aes(x = day, y = value, group = path_number, col = path_number)) +
geom_line() +
facet_wrap(~variable, ncol = 1, scales = "free") +
scale_colour_manual(values = cbpalette) +
theme_minimal() +
labs(x = "Day", y = "", col = "Path number", title = "Exponential behaviour")
The units don’t matter much.↩︎